privacy-filter: cap GPU memory + release cache to stop VRAM leak by lloydmak99 · Pull Request #51 · nearai/cvm-compose-files

lloydmak99 · 2026-05-29T23:10:06Z

Problem

privacy-filter (inline HF Transformers token-classification server in small-models.yaml, pipeline(..., device_map="auto"), per-request batch_size=32) has no GPU memory bound. Under steady traffic PyTorch's CUDA caching allocator ratchets its reserved memory up and never releases it, so the process slowly hoards the GPU it shares with Qwen3-VL, FLUX, embeddings, reranker and whisper (GPU 7).

Measured on 2026-05-29 (gpu11, H200 ~140 GB): recreating privacy-filter freed ~93 GB — for a model that needs ~1–2 GB.

Impact

As privacy-filter fills the card (free ~50 GB → ~0 over 1–2 days), the largest co-tenant Qwen3-VL (~49 GB at --gpu-memory-utilization 0.35) can no longer load and crash-loops with torch.AcceleratorError: CUDA error: out of memory. The same leak OOM'd embeddings/whisper on 2026-05-25 ("No available memory for cache blocks"). Affects both small-models hosts (gpu11 + gpu02 — identical config).

This is not a static GPU-budget misconfig of the small models, and not gemma (different GPUs): the vLLM/SGLang co-tenants hard-cap their VRAM, so the only unbounded consumer is the raw-HF privacy-filter.

How it was isolated

Recreate-and-watch (per-process nvidia-smi is unreachable — CVMs reject SSH, compose-manager has no exec): recreating FLUX freed only its ~22 GB static pool and Qwen3-VL kept crash-looping; recreating privacy-filter freed ~93 GB and Qwen3-VL recovered.

Fix

Inline server.py + container env:

torch.cuda.empty_cache() after every request (core fix) — returns cached-but-unused CUDA blocks to the driver so reserved memory stops ratcheting up.
torch.cuda.set_per_process_memory_fraction(GPU_MEM_FRACTION, 0) (fail-safe) — hard ceiling so the process self-OOMs/restarts instead of starving its neighbours. Default GPU_MEM_FRACTION=0.10 (~14 GB on a 140 GB H200), env-tunable without an image rebuild.
torch.inference_mode() around inference — no autograd state retained across requests.

Validated: small-models.yaml parses and the embedded server.py compiles.

Deploy

Normal tag + redeploy of small-models.yaml to both hosts (POST :8080/compose/up with the new tag, services:["<privacy-filter container>"], force_recreate:true).

⚠️ Interim: gpu11 was already mitigated by manually recreating the container (frees the leak but recurs in ~1–2 days). gpu02 still needs an immediate recreate until this ships. This PR makes the fix permanent.

Follow-up (optional)

If reserved-memory fragmentation still creeps, add PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True (left out here to avoid any interaction with the per-process fraction cap on torch 2.5.1).

privacy-filter is an inline HF Transformers token-classification server (`pipeline(..., device_map="auto")`) with no memory bound. Under steady traffic the CUDA caching allocator's reserved memory ratchets up and is never released, so the process slowly hoards the GPU it shares with Qwen3-VL, FLUX, embeddings, reranker and whisper (GPU 7). Observed ~93 GB held on an H200 for a model that needs ~1-2 GB. As privacy-filter fills the card (free ~50 GB -> ~0 over 1-2 days) the largest co-tenant, Qwen3-VL (~49 GB at --gpu-memory-utilization 0.35), can no longer load and crash-loops with `torch.AcceleratorError: CUDA error: out of memory`. The same leak OOM'd embeddings/whisper on 2026-05-25. Hits both small-models hosts (gpu11, gpu02) since they run identical config. Fix (inline server + container env): - empty_cache() after every request (core fix): returns cached-but-unused CUDA blocks to the driver so reserved memory stops ratcheting. - set_per_process_memory_fraction(GPU_MEM_FRACTION, 0) (fail-safe): hard ceiling so the process self-OOMs/restarts instead of starving neighbours. Default 0.10 (~14 GB on a 140 GB H200), env-tunable. - torch.inference_mode() around inference: no autograd state retained. Interim mitigation already applied by recreating the container, which frees the leaked VRAM but recurs in ~1-2 days; this makes it permanent. Ship via the normal tag + compose/up redeploy of small-models.yaml.

lloydmak99 · 2026-05-29T23:12:01Z

Tracking issue: nearai/infra#158

…_segments) Addresses the code review of the first cut: - Root cause now fixed at the source: PYTORCH_CUDA_ALLOC_CONF=expandable_segments lets the CUDA allocator shrink reserved segments instead of ratcheting up. - Drop per-request torch.cuda.empty_cache(): a synchronizing cudaFree on the hot path stalled the shared GPU and the co-located models it was meant to protect. A 30s watchdog thread now releases idle blocks off the request path. - Real fail-safe instead of a silent 500-storm: the watchdog hard-restarts the container (os._exit -> restart:unless-stopped) if this process's reserved VRAM exceeds GPU_MEM_LIMIT_GB, and an acute CUDA-OOM in a request also exits. The prior "self-OOMs and restarts" comment was false — a caught OOM returned 500 while the process stayed up behind a still-healthy /v1/models probe. - Drop set_per_process_memory_fraction: the 0.10 (~14GB) guess could OOM legit batch_size=32 requests, and device_map="auto" planned against the full card and ignored the cap anyway. Bound the work via PRIVACY_BATCH_SIZE instead; inputs are NOT truncated (a privacy filter must see the whole text). - device=0 instead of device_map="auto" (no accelerate planner mismatch). - Drop torch.inference_mode(): redundant with the pipeline's internal no_grad and stricter (risked raising under trust_remote_code custom models). - Tolerant env parsing + clamps so a malformed knob can't crash-loop boot. Validated: small-models.yaml parses and the embedded server.py compiles.

lloydmak99 · 2026-05-29T23:47:38Z

Revised in 7295a65 to address review:

Per-request empty_cache() removed — it was a synchronizing cudaFree on the hot path that would stall the shared GPU and the very models this protects. A 30s watchdog thread now releases idle blocks off the request path.
Root cause fixed at the source — PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True lets the allocator shrink reserved segments instead of ratcheting.
Real fail-safe — the "self-OOMs and restarts" claim was false (a caught CUDA OOM returned 500 while the process stayed up behind a healthy /v1/models probe). Now the watchdog os._exits if reserved VRAM exceeds GPU_MEM_LIMIT_GB, and an acute request OOM also exits → restart: unless-stopped actually recycles the container.
Dropped set_per_process_memory_fraction — the 0.10 (~14 GB) guess could OOM legit batch_size=32 requests, and device_map="auto" planned against the full card and ignored the cap. Bound the work via PRIVACY_BATCH_SIZE instead; inputs are not truncated (a privacy filter must see the whole text).
device=0 instead of device_map="auto"; dropped redundant/stricter torch.inference_mode(); tolerant env parsing so a bad knob can't crash-loop boot.

Validated: YAML parses and the embedded server.py compiles.

lloydmak99 requested a review from Evrard-Nil May 29, 2026 23:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

privacy-filter: cap GPU memory + release cache to stop VRAM leak#51

privacy-filter: cap GPU memory + release cache to stop VRAM leak#51
lloydmak99 wants to merge 2 commits into
mainfrom
fix/privacy-filter-gpu-leak

lloydmak99 commented May 29, 2026

Uh oh!

lloydmak99 commented May 29, 2026

Uh oh!

lloydmak99 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lloydmak99 commented May 29, 2026

Problem

Impact

How it was isolated

Fix

Deploy

Follow-up (optional)

Uh oh!

lloydmak99 commented May 29, 2026

Uh oh!

lloydmak99 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant